================================
Analysis of Prosper Loan Data
My analysis is done on the Prosper loan data file. This is a relatively large data set with nearly 114,000 records. The data set has a many variables, but only a subset has been analyzed below.
Load Data
The Prosper data set contains 113,937 records and 81 variables.
Univariate Plots and initial analysis
Generate summary, head and str data outputs (turned off for final output)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554

Debt to Income ratio limited to 100% and below shows a long tail distribution but as expected must loans are on the lower end of the scale.

A log10 view of the previous plot shows a much more normal distribution of the data.
It would be interesting to compare Debt to Income Ratio with the ratio of monthly payment to monthly income. I will create that variable and plot it in the following step

Payment to Income Ratio has much more of a long tail distribution than debt to income ratio. Definitely an area to explore further in bivariate and multivariate sections
## Completed FinalPaymentInProgress Current
## 38074 205 56576
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
## Defaulted Chargedoff NA
## 5018 11992 0
## NA's
## 5

Loan status counts. The bulk of the loan statuses are in 4 categories: Charged Off, Defaulted, Completed (loan paid off) and Current.
This looks like it may be a good area to focus on.
NOTE: LoanStatus was sorted into a logical order

The majority of the data is for smaller loans (<$20K). There are some spikes which appear at approximately 5,000 increments. We could surmise that a cusotmer used rounded figures in their loan requests, this would especially be true of personal loans and other types of loans which are not for a specific piece of collateral.

Loan payments follow a similar trend to the original loan amount as expected.

The term plot solidifies why loan payments and loan amounts are in alignment as there is littel variation in loan terms with nearly all of them being in the 3 year or 5 year category.

Credit score follows a fairly standard distribution. Data was limited to values of 300 to 990 since those are lowest and highest possible values supported by both FICO and Vantage scoring systems. VantageScoring was changed to an upper limit of 850 in 2013 but the data goes back to 2005.

It was somewhat expected that Borrower APR would vary a lot given the data spans 10 years. The large spike at .36 was not expected.

This was mostly informational. Debt consolidation (per the data dictionary found here: ) was the most common loan category. A category of 0 is listed as “other”

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
The 2 plots above show a distribution of the Prosper Ratings for the loans. The alpha rating is the same as numeric with the expected exception that empty values are plotted on the alpha diagram. Per the Prosper website (https://www.prosper.com/plp/invest/prosper-ratings/), Prosper Rating is defined as:
The Prosper Rating is our proprietary system that allows us to maintain consistency when evaluating each loan application. Prosper Ratings allow investors to consider a loan’s level of risk because the rating represents an estimated average annualized loss rate range.
Note: ProsperRating..Alpha. was sorted appropriately
This is anothe area to be analzed further to see how Prosper Rating correlates to other loan data.
NOTE: there are over 29000 NA values for Prosper Ratings. This is due to prosper scores not being used until July 2009 per the Prosper website

This is is a comparison of Employment Status Duration with with a both a as is x scale and a log 10 view. Its not clear from the data if the long tail distribution is a result of more applicants being younger and new to the work force or simply the rsult of applicants changing careers
## Completed FinalPaymentInProgress Current
## 38074 205 56576
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
## Defaulted Chargedoff NA
## 5018 11992 0
## NA's
## 5
Univariate Analsys Summary
Data Structure
The data contains a very large number of variables with data about each loan from Prosper Bank ranging from 2005 to 2014. There are 81 variables in nearly 114000 observations
Primary interest area?
I want to explore what factors contribute to the ability of a customer to stay current on their loan payments. Primarily this will mean looking at primary variables which are central to getting a loan and making a loan payment: The amount of the monthly pmt, the customers income and their credit rating. I will use this data to see how it relates to Loan Status and the Prosper Rating.
Features in the dataset which will help in the analysis
In the initial exploration of the data set, it appears that factors such as credit rating, loan rate, prosper rating and employment status amount will need further exploration.
New variables created
In the initial analysis, new variables were created for loan status as well as payment to income ratios. Since I am looking at loan status vs Prosper score, I created a few different variations of the loan status data to aid in the analysis. These are explained in the above code block
** loan.status.rollup ** [Converted to Factor] Paid Off = Completed + FinalPaymentInProgress In Default = Defaulted + ChargedOff (based on banking industry defs) Missed 1-2 Pmts = Past Due 1-15, 16-30, 31-60 Missed 3+ Pmts = Past Due 61 days and over Current = Current
** loan.status.num ** numeric value field for loan status
** loan.status.rollup ** sortable field for loan status
Unusual distributions
No significant chagnes to the data were made based on the distributions. As mentioned earlier, the loan status data was updated for plotting. Also noted that the prosper rating has 29,000 NA values.
Employment duration and Debt to Income ratio show a long tail distribution but both looked more reasonable using log10 scale.
Listing category was very skewed towards loan categories of Debt Consolidation and “other”. More research is needed here
Bivariate Plots and Analysis

##
## Pearson's product-moment correlation
##
## data: MonthlyLoanPayment and DebtToIncomeRatio
## t = 8.9592, df = 105380, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02155464 0.03362092
## sample estimates:
## cor
## 0.02758878
Monthly Loan Payment plotted against Debt To Income Ratio shows some correlation but it is not strong. The correlation calculation produces a small value which corresponds

##
## Pearson's product-moment correlation
##
## data: MonthlyLoanPayment and StatedMonthlyIncome
## t = 67.758, df = 113930, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1912280 0.2023915
## sample estimates:
## cor
## 0.1968162
The correlation between Monthly Loan Payment and Monthly Income is stronger. In future plots we can use the pmt.to.income.ratio calcuated variable to compare this data to other variables.

##
## Pearson's product-moment correlation
##
## data: pmt.to.income.ratio and DebtToIncomeRatio
## t = 230.48, df = 104490, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5765168 0.5845562
## sample estimates:
## cor
## 0.5805506
The Payment to Income Ratio has a pretty strong correlation to Debt to Income ratio. However, it should be noted that Debt to Income Ratio is calculated at the time of the loan and does not inclue the loan itself. I believe what this data is really telling us is that it is unlikely that loan applicans do not already have debt prior to the loan being sought.

The boxplot shows what is largely expected but the $0 income range data is suspect. Based on the median and 3rd quartile, it appears this may be situations where income was not recorded or was not given.

Scatter plot shows that higher Ratings are aligned with current loans and the rating trends down as the loan payment situation deteriorates. This downward trend was not as clear as I expected it to be.

The box plot output looks reasonable across the different rollups with the In Default category producing the lowest overal mean and median prosper score. This indicates that the Prosper rating is accurate. Interesting note is that the 3rd quartile is much lower for paid off loans than it is for current loans. Certainly this is brought down by the fact that many loans at the low end of the propser scale do pay off their loans.

We see a strong positive correlation between prosper ratings and credit scores

A negative correlation was expected between loan status and credit score but it is weaker than expected.

A boxplot view of credit score and Loan status clearly shows the lower credit score ranges for defaulted and charged off loans. Interestingly, the credit score range for completed loans has a lower 1st quartile than all other categories except Defaulted and Charged off. This may indicate that historically trend lower score customers paid off their loans more often, thus lowering the 1st quartile. The x axis labels are hard to read but they follow a best to worst order with best being “Completed” and worst being “Charged Off”

The Box Plot shows that median prosper scores are nearly identical for applicants who are employed full time. I am unsure what a status of “Other” is but based on the ranges presented, it appears to be similar to self-employed.
Bivariate Analysis summary
General Observations
The Prosper data elements don’t appear to show very strong linear relationships. While some areas are stronger than others, consistent patterns were somewhat difficult to find. This not too surpising given the nature of the data. There are many external factors which impact the status of loan. Employment status (after the loan is secured, economic factors, marital status, etc will all play a role in determining the ability to make loan payments.
Some areas which did stand out: - Monthly Income and Monthly loan payments and calculated ratio - Credit score has a fairly strong correlation with incomee, loan payment and the prosper rating. It also has correlation with loan status, but negative (as expected) - Loan status and Prosper rating were not as correlated as would have been expted given that the Prosper rating was established to provide risk score for each loan. This is where external forces likely play a role.
I noted that certain variables are almost certainly weighted more strongly in the algorithm which prodcues the prosper sccore. Credit Sccores, debt to income ratio and monthly income all appear to be contributing factors.
Observations on other features
Long Employment Duration is generally associated with lower payment to income ratios. The downward trend is not as strong as might have been expected.
Interesting note that self-employed customers have the same 1st quartile level for Prosper Rating as Retired and part-time customers but their credit score range is at the higher end
Revolving lines of credit data is pretty consistent across payment to income ratio.
Debt to Income Ratio vs. Loan Payment. There were comments on this above but I did not expect to see the volume of loans with a very high ratio
Strongest relationship
Credit scores were the strongest relationships in the data. Monthly Income, Monthly Loan Payment, Pmt to Income ration and ProsperRating all showed strong relationships withcredit scores.
Multivariate Plots and Analysis

Lower prosper ratings are correlated more closely to lower loan payments than higher prosper rating applications. Interesting to note that the low payment for low prosper ratings does not seem to change acorss income levels.

With the Income and Loan Payment plot by Loan stats we start seeing patterns emerge regarding the loan status categories. Defaulted / Charged off loans are more likely to have high loan pmts and lower incomes whereas paid off loans show an opposite trends. Will create a plot using pmt.to.income.ratio to see if this holds true.

While its a different look the trend seen previously does hold true. Defaulted loans are generally above 50% debt to income ratio and above 20% Payment to Income. We should also note that the Debt to Income ratio is prior to the loans in the file.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554

Plotting Income and Loan Payment gives a good view of how closely aligned the ratio of loan to income is with debt to income at the time of the loan. Higer debt to income ratios are clearly at the lower end of the y axis.
Debit to Income Ratio bucket was created for this plot
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 590

THis view shows that lower credit scores are more prevelant for loans which are in a default state.
Credit score range buckets were created for this plot.

The prosper rating plot shows a more clear picture of how credit range changes as propser rating changes. This implies that Prosper Rating uses Credit score as one of its primary inputs (The formula used for Prosper score is proprietary and I was unable to find exactly how it is calculated.

The facet view of pmt to income ratio by debit to income ratio by credit range for each Prosper rating shows more clearly how credit scores change by prosper rating

The dual plot shows a good comparison of how prosper rating differs from loan status across the same set of data. Both show simlar trends where higher payment to income ratios and debt to income ratios lead to low scores and / or defaulted loans.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28384 0.51229 25

A view of Borrower APR buckets by Credit sore and payment to income ratio. This is reflective of higher credit score applicants getting better loan rates. This holds true even as the ratio of payment to income increases.
Borrower APR buckets were created for this plot
Multivariate Analysis Summary
Talk about some of the relationships you observed in this part of the
One of the main areas I focused on was how Prosper Ratings treated debt to income ratios as well as as monthly payment to monthly income ratios. I aslo looked at similar plots using the loan status which is the reality of the status of each loan. Similar plots were also done with Credit ranges. Looking at dual plots of the same x and y data but colored by prosper and loan status were the most helpful for me in getting a view into the accuracy of Prosper Ratings.
Were there any interesting or surprising interactions between features?
It’s not surprising that defaulted loans tend to have higher debit to income and pmt to income ratios, but it was surprising to see the spike of high debt to income ratios for Paid off loans. It was also interesting to note the often considerable difference between what the loan status shows as the actual status of the loan and what the ProsperRating estimated the loan risk to be
OPTIONAL: Did you create any models with your dataset? Discuss the
Model not created.
Final Plots and Summary
Plot One

Description One
This plot demonstrates that customers who default on loans tend to have high Debt to Income and Payment to Income Ratios. The defaulting customers are represented in green and are predominantly above 15% Payment to Income Ratio or above 50% Debit to Income Ratio (in many cases both are true. As I explored this data, it was apparent that there were exceptions to everything and I go into this further in the the reflection section. Another interesting note is the low end of the Payment To Income Scale. These data is dominated by paid off loans, regardless of the Debit to Income Ratio. Even at 50% Debit to Income Ratio, customers with very low Payment to Income Ratio were paying off loans. THis is an indicator that Payment to Income Ratio could be a more important factor in predict the eventual loan outcome
Plot Two

Description Two
THis plot also looks at the Payment to Income and Debit to Income Ratios, but breaks it down by Credit Scores and creates Facets based on the Prosper Rating. It clearly shows that Credit Scores are weighted significantly in the creation of Prosper Ratings as is Debit to Income Ratios. Each succeeding facet shows a lowering of the overall credit rating and an extension of the how high the Debt to Income ratio is. We also see a slight increase in Payment to Incomee Ratio for each facet but it does not steadily change. While I could not find data to prove it stands to reason that the banking institution woudl deny loans which are risky due to lack of adequate income.
Plot Three

Description Three
The final plot illustrates some possible deficiencies in the Propser Rating system. The first plot shows a trend of higher Debt to Income Ratio as the Prosper Rating declines – we touched on this in previous plots – and the worst ratings (E and HR) dominate the plot for the for very high ratios (roughly 55% and higher). However, we do not see this clear progression in the Loan Status Plot (plot 2). In the analysis process it was determined that is not truly accurate to include “current” loans when comparing Prosper Ratings to Loan Status. Prosper attempts to assign a risk factor to a loan and you would not be able to tell if the rating accurately reflected the risk until the loan closes (through pay off or default). This is not to say that high risk automatically means a defaulted loan but we would expect that defaulted loans to be more heavily represented on the high end of Prosper Rating scale. The third plot excludes the cu rrent loans and here we see more clearly that defaulted loans do not follow the trends seen in the Ratings plot. They are actually distributed pretty evenly across the data points.
Reflection
The Prosper data set presented a challenge due to sheer number of variables that were available to explore. I focused on the more common elements that you would expect for loan data and analyzed how those variables correlated with the Prosper Rating system as well as the acdtual loan status. I was able get a pretty strong idea of what variables are considered by the Prosper Rating system (The Faceted plot in the final section is a great example of that) and also got insight into what factors impacted the loan status – Debt To Income Ratio & Credit Score of the applicant
One of the major struggles was finding strong correlation between different variables. There was definite correlation found but rarely as strong as what I expected to find. It became apparent as the analysis went on that external factors not reprsented in the data play a significant role in the outcome of the the loan. Changes in economy, marital status changes, employment status changes, medical conditions and many other factors which occur after a loan is taken out will impact a persons ability to make payments. Ironically, many of those same factors may cause a person to see a loan in the first place.
Future work on this data set to would be to explore the other variables which are included in the data set to see how they might impact a loan’s status. I also believe research into external factors is necessary to truly understand what impacts the outcome. Are extenuating circumstances more common amongst customers with worse prosper ratings? What factors cause higher rated and higer income customers to default. Knowing why payments are missed and why loans are defaulted would be very beneficial to predicting an outcome. This would certainly mean the analysis of significant amounts of data.